Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the schema case sensitivity to false for Hive #234

Merged
merged 2 commits into from
Mar 4, 2022

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Feb 25, 2022

Problem description

Assume that we have two tables which contain a column with the same name:

    CREATE TABLE test.table_same_column_name_a (some_id string)
    CREATE TABLE test.table_same_column_name_b (some_id string)

Assume also that we have a view doing a select from the result of these table joined together and the name of the common column is used in a different case (some_id, SOME_ID):

   CREATE VIEW IF NOT EXISTS test.view_column_name_case_insensitive AS
       SELECT a.some_id
       FROM test.table_same_column_name_a a
       LEFT JOIN (SELECT trim(some_id) AS SOME_ID FROM test.table_same_column_name_b) b
             ON a.some_id = b.some_id
       WHERE a.some_id != ''

When translating in Coral from Hive, the following result is being obtained:

SELECT some_id
FROM (SELECT table_same_column_name_a.some_id, t.SOME_ID
FROM default.table_same_column_name_a
LEFT JOIN (SELECT TRIM(some_id) SOME_ID, CAST(TRIM(some_id) AS STRING) $f1
FROM default.table_same_column_name_b) t ON table_same_column_name_a.some_id = t.$f1) t0
WHERE t0.some_id <> ''

As it can be seen from the query above this leads to having an ambiguous column, because there is both "some_id" and "SOME_ID" which in Hive point to the same column.

Solution description

Pointing out to coral that Hive is a case insensitive system, solves the problem by creating unique names for the columns in the selection.

...
SELECT table_same_column_name_a.some_id, t.SOME_ID SOME_ID0
...

When joining tables that have the same column name
,irrespective of their case, make sure that the
names of the columns have unique names to avoid
the situation where the SQL statement created
contains ambiguous column names.

@findepi
Copy link
Contributor

findepi commented Feb 25, 2022

cc @wmoustafa

Copy link
Contributor

@wmoustafa wmoustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the PR description with an example query (preferably simplest one), and the behavior before and after the patch?

Can we add a Hive to Rel conversion test?

Comment on lines 117 to 118
run(driver, "create table table_same_column_name_a (some_id string)");
run(driver, "create table table_same_column_name_b (some_id string)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please follow the same casing convention as the rest of the test cases? (I see some are lowercase, but these seem to have slipped).

run(driver, "create table table_same_column_name_a (some_id string)");
run(driver, "create table table_same_column_name_b (some_id string)");
run(driver,
"create view if not exists view_column_name_case_insensitive as\n" + " select a.some_id\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you simplify the view to the minimal one that satisfies the test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create table hive.test.table_a (
  some_id string
);

create table hive.test.table_b (
  some_id string
);

These 2 views will have the same issue:

-- Hive view version 1
create or replace view test.view_ab as
select a.some_id
from   test.table_a a
left join 
(
  select trim(some_id) AS SOME_ID
  from   test.table_b
) b
on    a.some_id = b.some_id
where a.some_id != ''
-- Hive view version 2
create or replace view test.view_ab2 as
select a.some_id
from   test.table_a a
left join 
(
  select some_id AS SOME_ID
  from   test.table_b
) b
on    a.some_id = trim(b.some_id)
where a.some_id != ''

The WHERE clause is needed in order to enforce the usage of the inner select. I don't have at hand a sample view easier to grasp than the ones above.

@@ -114,6 +114,14 @@ public static void initializeViews(HiveConf conf) throws HiveException, MetaExce
run(driver, String.join("\n", "", "CREATE VIEW IF NOT EXISTS named_struct_view", "AS",
"SELECT named_struct('abc', 123, 'def', 'xyz') AS named_struc", "FROM bar"));

run(driver, "create table table_same_column_name_a (some_id string)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_same_column_name_a -> duplicate_column_name.tableA (same for the other table)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the update for the spark tests.
However the other tests follow a slightly different naming scheme for the test tables. Do you want me to adjust the naming for the other tests (hive/trino) ?

@findinpath findinpath force-pushed the hive-case-insensitive branch 3 times, most recently from 66733bd to 1baab42 Compare February 28, 2022 11:21
@findinpath findinpath requested a review from wmoustafa February 28, 2022 11:23
Copy link
Collaborator

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath Thanks for this PR!
I have kicked off the i-test and will update once it finishes.
Could you please run ./gradlew spotlessApply to fix the code format issue? Please ensure that ./gradlew clean build succeeds before pushing.

@findinpath findinpath force-pushed the hive-case-insensitive branch from 1baab42 to f1527ef Compare February 28, 2022 15:49
Copy link
Collaborator

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and all the i-tests passed.
Thanks @findinpath @findepi !
@wmoustafa @funcheetah do you want to take another look?

In the translation of views, when joining tables
that have the namesake column names ,irrespective of
their case, make sure that the resulting relation
is using names of the columns which have unique names
in order to avoid the situation where the
SQL statement created contains ambiguous column names.
Copy link
Contributor

@funcheetah funcheetah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @findinpath for the PR. LGTM.

@ljfgem ljfgem merged commit a33149e into linkedin:master Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants